── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.2 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.4.3 ✔ tibble 3.2.1
✔ lubridate 1.9.2 ✔ tidyr 1.3.0
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library("broom")
Load data
df <-read_tsv("../data/03_dat_aug.tsv")
Rows: 70692 Columns: 32
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (9): Smoking_Status, Diabetes_Status, Sex_character, Age_Range, Income_...
dbl (23): Diabetes_binary, HighBP, HighChol, CholCheck, BMI, Smoker, Stroke,...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
The second script of analysis will be related to the performing PCA and the creation of some models with different variations. These variations are based on the creation of three different models. The first of them will use the components used to reach 80% variability after the PCA analysis combined with the use of logistic regression for prediction. The second and third ones will be different GLM for men and women. Thanks to this we will be able to analyze whether gender has a special importance when predicting patients with diabetes.
PCA
It has to be mentioned that we have used the code provided by the lecturers (PCA tidyverse) and apply it using our data to achieve some conclusions. That is why the code will be the achieved conclusions will be completely different. Thus, that is why we want to highlight that we have been able to understand how Principal Component Analysis (PCA) works thanks to this code and how to interpret the obtained results. Lets start with PCA.
PCA is a dimensionality reduction method that is often used to reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set. PCA can be used for many purposes, however, we have opted to apply this technique to create models with less variables to try to gain more meaningful conclusions of them. Thus, we are using it to reduce the dimensionality to simplify our data set with many variables, to make it more manageable for analysis.
Having said that, first thing that has to be done is to work only with numerical variables. PCA is typically applied to continuous variables, and it assumes that the variables are on a numerical scale. PCA is based on the covariance matrix, which involves computing variances and covariances between numeric variables. Therefore, directly applying PCA to a dataset with categorical variables is not appropriate However, there are techniques that extend PCA to handle categorical variables. One such technique is called Multiple Correspondence Analysis (MCA), which is an extension of PCA for categorical data. MCA is suitable for datasets where variables are categorical and can take more than two levels. As we are not working with categorical data, we will forget about this idea.
After checking which are the numerical variables of our data set, we will only select those ones to perform the analysis. Before using PCA, we have to make sure that all our data is playing on a level field. Scaling means adjusting the values so they’re all in a similar range.
Now, we will create a special kind of map for our data using PCA. To do this, we blend the PCA results with the original data, adding back the information we temporarily set aside. It’s like bringing back the colors to our points based on categories that were there in the first place but were temporarily taken away for PCA. We use a tool called augment() from the “broom” package to make this happen. It needs the model we created and the original data as inputs.